Introduction to unsupervised machine learning

Ben Lambert

Material covered today

  • what is meant by machine learning?
  • dimensionality reduction methods: PCA, t-SNE and UMAP
  • clustering

What is machine learning? (At two levels of difficulty.)

Level 1

Varieties (ignoring reinforcement learning)

Supervised: classification

Supervised: regression

Unsupervised: data

Unsupervised: example result

Level 1: summary

Machine learning comes in two varieties:

  • supervised learning:
    • typically lots of data-label pairs
    • aim is to build a model data -> label
    • categorical labels: classification
    • numeric labels: regression
  • unsupervised learning:
    • unlabelled data
    • goals are vaguer but generally aims to simplify data and uncover patterns

Level 2

How does a computer “see” a cat?

How many images are possible?

  • for a 20 x 20 binary image -> \(X\) has dimensionality of 400
  • \(2^{400}\approx 2 \times 10^{120}\) possible images
  • a very small proportion of those correspond to real world type images
  • a very small proportion of real world images correspond to cats
  • idea: even if dimensionality is big, effective dimensionality much lower
    • ML aims to find these lower dimensional representations

Supervised learning

Supervised learning

Rule determination

  • Want to learn a rule \(f: X \rightarrow y\)
  • Rule is a mathematical function controlled by low-dimensional parameters: \(f=f(\theta)\)
  • Have training data:

\[(X_1, y_1), (X_2, y_2), ..., (X_n, y_n)\]

Can we learn \(f\) by optimising \(\theta\) on training data?

Example rules

What is \(\; f\)?

  • Linear combination of elements of \(X\) (linear regression)
  • Linear combination of functions of elements of \(X\) (kernel regression)
  • Regression trees (random forests, boosted regression)
  • Non-linear combinations of elements, stacked into multiple layers (deep learning)

How to learn optimal parameters?

Unsupervised learning

Unsupervised learning

Unsupervised learning: what does \(Z\) capture?

Unsupervised learning: clustering

Level 2: summary

  • ML algorithms take numeric objects (vectors / matrices / tensors) as input
  • intrinsic dimensionality of most things \(<\) raw dimensions: world simpler
  • supervised learning:
    • determines a mathematical function to predict outputs from inputs
    • function depends on parameters which must be learned using training / testing data
    • learning based on optimising cost function

Level 2: summary

  • unsupervised learning:
    • attempts to find more parsimonious representation of data
    • low dimensional variables learned may be more interpretable
    • clustering is an example of unsupervised ML

Questions?

Unsupervised learning

Flavours of unsupervised learning

  • dimensionality reduction
  • clustering (really a type of dim. reduction)
  • (outlier detection)

Dimensionality reduction

What is dimensionality reduction?

  • most real life things have lots of features
  • many features exhibit a degree of redundancy
  • the effective number of important features is lower and we aim to identify these

Why reduce dimensions?

  • universe is complex
  • science aims to understand constituent laws to simplify universe
  • more parsimonious theories have greater information compression and tend to generalise better

Why can dimensionality reduction help?

  • improves interpretability
  • aids visualisation
  • extracts core features for supervised learning
  • (lossy) information compression
  • noise reduction

Classes of dimensionality reduction

  • projection
  • manifold learning

Wine data

country designation points price province region_1 region_2 variety winery
US Martha’s Vineyard 96 235 California Napa Valley Napa Cabernet Sauvignon Heitz
Spain Carodorum Selección Especial Reserva 96 110 Northern Spain Toro Tinta de Toro Bodega Carmen Rodríguez
US Special Selected Late Harvest 96 90 California Knights Valley Sonoma Sauvignon Blanc Macauley
US Reserve 96 65 Oregon Willamette Valley Willamette Valley Pinot Noir Ponzi
France La Brûlade 95 66 Provence Bandol Provence red blend Domaine de la Bégude
Spain Numanthia 95 73 Northern Spain Toro Tinta de Toro Numanthia
Spain San Román 95 65 Northern Spain Toro Tinta de Toro Maurodos
Spain Carodorum Único Crianza 95 110 Northern Spain Toro Tinta de Toro Bodega Carmen Rodríguez
US Silice 95 65 Oregon Chehalem Mountains Willamette Valley Pinot Noir Bergström
US Gap’s Crown Vineyard 95 60 California Sonoma Coast Sonoma Pinot Noir Blue Farm
Italy Ronco della Chiesa 95 80 Northeastern Italy Collio Friulano Borgo del Tiglio
US Estate Vineyard Wadensvil Block 95 48 Oregon Ribbon Ridge Willamette Valley Pinot Noir Patricia Green Cellars
US Weber Vineyard 95 48 Oregon Dundee Hills Willamette Valley Pinot Noir Patricia Green Cellars
France Château Montus Prestige 95 90 Southwest France Madiran Tannat Vignobles Brumont
US Grace Vineyard 95 185 Oregon Dundee Hills Willamette Valley Pinot Noir Domaine Serene
US Sigrid 95 90 Oregon Willamette Valley Willamette Valley Chardonnay Bergström
US Rainin Vineyard 95 325 California Diamond Mountain District Napa Cabernet Sauvignon Hall
Spain 6 Años Reserva Premium 95 80 Northern Spain Ribera del Duero Tempranillo Valduero
France Le Pigeonnier 95 290 Southwest France Cahors Malbec Château Lagrézette
US Gap’s Crown Vineyard 95 75 California Sonoma Coast Sonoma Pinot Noir Gary Farrell
US Grignolino 95 24 California Napa Valley Napa Rosé Heitz
Spain Prado Enea Gran Reserva 95 79 Northern Spain Rioja Tempranillo Blend Muga
Spain Termanthia 95 220 Northern Spain Toro Tinta de Toro Numanthia
US Giallo Solare 95 60 California Edna Valley Central Coast Chardonnay Center of Effort
US R-Bar-R Ranch 95 45 California Santa Cruz Mountains Central Coast Pinot Noir Comartin
New Zealand Maté’s Vineyard 94 57 Kumeu Chardonnay Kumeu River
US Shea Vineyard 94 62 Oregon Willamette Valley Pinot Noir Bergström
US Abetina 94 105 Oregon Willamette Valley Willamette Valley Pinot Noir Ponzi
US Garys’ Vineyard 94 60 California Santa Lucia Highlands Central Coast Pinot Noir Roar
US The Funk Estate 94 60 Washington Walla Walla Valley (WA) Columbia Valley Syrah Saviah
Bulgaria Bergulé 90 15 Bulgaria Mavrud Villa Melnik
US Babushka 90 37 California Russian River Valley Sonoma Chardonnay Zepaltas
Italy Vigna Piaggia 90 NA Tuscany Brunello di Montalcino Sangiovese Abbadia Ardenga
France Nonpareil Trésor Rosé Brut 90 22 France Other Vin Mousseux Sparkling Blend Bouvet-Ladubay
US Conner Lee Vineyard 90 42 Washington Columbia Valley (WA) Columbia Valley Chardonnay Buty
Italy Riserva 90 135 Tuscany Brunello di Montalcino Sangiovese Carillon
France 90 60 Rhône Valley Châteauneuf-du-Pape Rhône-style White Blend Clos de L’Oratoire des Papes
Italy 90 29 Tuscany Vino Nobile di Montepulciano Sangiovese Avignonesi
Italy 90 23 Tuscany Chianti Classico Sangiovese Casina di Cornia
Italy Riserva 90 29 Tuscany Chianti Classico Red Blend Castello di Monterinaldi
Spain Amandi 90 17 Galicia Ribeira Sacra Mencía Don Bernardino
Spain Alfonso Oloroso Seco 90 26 Andalucia Jerez Palomino González Byass
US Private Reserve 90 55 Idaho Petite Sirah Huston
Italy Riserva 90 39 Tuscany Chianti Classico Red Blend Rignana
France Coteaux 90 69 Rhône Valley Cornas Syrah Tardieu-Laurent
Italy Vigneto Odoardo Beccari Riserva 90 30 Tuscany Chianti Classico Red Blend Vignavecchia
Italy Poggio alle Mura 90 90 Tuscany Brunello di Montalcino Sangiovese Banfi
US Estate Grown 90 60 California Mount Veeder Napa Cabernet Sauvignon Brandlin
Italy 90 50 Tuscany Brunello di Montalcino Sangiovese Brunelli Martoccia
US 90 40 Washington Red Mountain Columbia Valley Cabernet Sauvignon Canvasback
Italy Riserva 90 100 Tuscany Brunello di Montalcino Sangiovese Capanne Ricci
France 90 68 Burgundy Chassagne-Montrachet Chardonnay Chartron et Trébuchet
France Les 7 Hommes 90 42 Loire Valley Sancerre Sauvignon Blanc Cherrier Frères
France L’Inédit 90 28 Loire Valley Coteaux du Giennois Pinot Noir Clement et Florian Berthier
US 90 18 California Russian River Valley Sonoma Chardonnay De Loach
US Four Flags 90 69 Washington Red Mountain Columbia Valley Cabernet Sauvignon DeLille
France Le Pavé 90 NA Loire Valley Sancerre Sauvignon Blanc Domaine Vacheron
US Reserve 90 25 New York Finger Lakes Finger Lakes Riesling Dr. Konstantin Frank
US Final Final 90 30 Washington Columbia Valley (WA) Columbia Valley Cabernet Sauvignon-Syrah Efeste
Italy Poggio Bestiale 90 60 Tuscany Maremma Toscana Red Blend Fattoria di Magliano
Argentina The Apple Doesn’t Fall Far From The Tree 91 30 Mendoza Province Mendoza Malbec Matias Riccitelli
Australia 91 36 Victoria Mornington Peninsula Pinot Noir Moorooduc
Argentina Alegoría Gran Reserva 91 25 Mendoza Province Mendoza Malbec Navarro Correas
France L’Homme Mort Premier Cru 91 45 Burgundy Chablis Chardonnay Domaine Chenevières
Portugal 91 23 Alentejano Portuguese Red Herdade do Rocim
US Estate Select 91 36 California Santa Clara Valley Central Coast Syrah Jason-Stephens
France Fourchaume Premier Cru 91 38 Burgundy Chablis Chardonnay Louis Max
US Animo 91 85 California Napa Valley Napa Cabernet Sauvignon Michael Mondavi Family Estate
US Schindler Vineyard 91 50 Oregon Eola-Amity Hills Willamette Valley Pinot Noir Panther Creek
US Barrel Select 91 60 California Rutherford Napa Cabernet Sauvignon Provenance Vineyards
US District Collection 91 85 California St. Helena Napa Cabernet Sauvignon Raymond
US 91 45 California Sonoma Coast Sonoma Pinot Noir Red Car
Italy Bussia Riserva 91 NA Piedmont Barolo Nebbiolo Silvano Bolmida
US 91 19 Oregon Willamette Valley Willamette Valley Pinot Gris Trinity Vineyards
Portugal Premium 91 15 Alentejo Portuguese Red Adega Cooperativa de Borba
US Premier Cuvée 91 54 Oregon Willamette Valley Willamette Valley Pinot Noir Archery Summit
France Le Nombre d’Or Brut Nature 91 85 Champagne Champagne Chardonnay Aubry
US Juliana Vineyard 91 38 California Napa Valley Napa Sauvignon Blanc B Cellars
US 91 28 California Napa Valley Napa Cabernet Sauvignon B Side
Italy Boscato 91 75 Piedmont Barolo Nebbiolo Bel Colle
US Aeolian 91 42 Oregon Eola-Amity Hills Willamette Valley Pinot Noir Bethel Heights
Israel Reserve 91 25 Upper Galilee Cabernet Sauvignon Binyamina
Italy Palliano Riserva 91 NA Piedmont Roero Nebbiolo Ceste
Italy del Comune di Serralunga d’Alba 91 59 Piedmont Barolo Nebbiolo Cascina Cucco
Italy Bricco Luciani 91 85 Piedmont Barolo Nebbiolo Cascina del Monastero
Italy Bricco Gattera 91 80 Piedmont Barolo Nebbiolo Cordero di Montezemolo
France Montmains Premier Cru 91 45 Burgundy Chablis Chardonnay Domaine Gérard Duplessis
US 91 22 California Napa Valley Napa Cabernet Sauvignon Eagle Glen
US Bacigalupi Vineyard 91 65 California Russian River Valley Sonoma Pinot Noir Eleven Eleven
US Magnificat 91 50 California Napa Valley Napa Meritage Franciscan
US 86 10 California California California Other Cabernet Sauvignon Belle Ambiance
Portugal Marquês de Marialva Rosé Bruto 86 12 Beira Atlantico Baga Adega de Cantanhede
Italy Nature 86 22 Veneto Prosecco Glera De Stefani
US Small Lot Blend 86 13 California Mendocino County Mendocino/Lake Counties Chardonnay Parducci
Portugal Muros de Vinha 86 10 Douro Portuguese Red Quinta do Portal
France Château Beauvillain-Monpezat 86 14 Southwest France Cahors Malbec-Merlot Rigal
US 86 18 California California California Other Chardonnay The Naked Grape
US 86 36 California Sonoma Valley Sonoma Cabernet Sauvignon Tin Barn
France Pigmentum 86 15 Southwest France Buzet Merlot-Malbec Georges Vigouroux
France Pigmentum 86 10 Southwest France Côtes de Gascogne Ugni Blanc-Colombard Georges Vigouroux

Projection

Types

  • Principal Components Analysis (PCA)
  • Linear discriminant analysis
  • Kernel PCA

PCA

Example raw data

Modelling data

  • Looks normally distributed:

\[\begin{equation} (x_1,x_2)' \sim \mathcal{N}(\mu, \Sigma) \end{equation}\]

where \(\Sigma\) is dense.

  • Can we use this assumption to move to more natural coordinate system? I.e.

\[\begin{equation} (y_1,y_2)' \sim \mathcal{N}(0, D) \end{equation}\]

where \(D\) is diagonal.

Example raw data

Assumed generative model

1st PC component axis

2nd PC component axis

How to obtain PC axes?

Remember, we’ve assumed:

\[\begin{equation} (x_1,x_2)' \sim \mathcal{N}(\mu, \Sigma) \end{equation}\]

  1. Centre data:

\[\begin{equation} (\tilde x_{1,i}, \tilde x_{2,i}) = (x_{1,i}, x_{2,i}) - (\bar x_{1}, \bar x_{2}) \end{equation}\]

  1. Estimate covariance matrix:

\[\begin{equation} \widehat{\Sigma} = \frac{1}{n}\sum (\tilde x_{1,i}, \tilde x_{2,i})' (\tilde x_{1,i}, \tilde x_{2,i}) \end{equation}\]

How to obtain PC axes?

  1. Eigendecompose:

\[\begin{equation} \widehat{\Sigma} = P D P' \end{equation}\]

  • \(P=[P_1, P_2]\) is matrix of eigenvectors of \(\widehat{\Sigma}\) representing PC directions:

\[\begin{align} y_1 &= P_1' . (\tilde x_1, \tilde x_2)\\ y_2 &= P_2' . (\tilde x_1, \tilde x_2) \end{align}\]

  • \(D\) is diagonal with eigenvalues as diagonal elements
  • eigenvector magnitudes indicate relative variance explained by that PC

Apply PCA to wine data

country designation points price province region_1 region_2 variety winery review_id
US Martha’s Vineyard 96 235 California Napa Valley Napa Cabernet Sauvignon Heitz 1
Spain Carodorum Selección Especial Reserva 96 110 Northern Spain Toro Tinta de Toro Bodega Carmen Rodríguez 2
US Special Selected Late Harvest 96 90 California Knights Valley Sonoma Sauvignon Blanc Macauley 3
US Reserve 96 65 Oregon Willamette Valley Willamette Valley Pinot Noir Ponzi 4
France La Brûlade 95 66 Provence Bandol Provence red blend Domaine de la Bégude 5
Spain Numanthia 95 73 Northern Spain Toro Tinta de Toro Numanthia 6
Spain San Román 95 65 Northern Spain Toro Tinta de Toro Maurodos 7
Spain Carodorum Único Crianza 95 110 Northern Spain Toro Tinta de Toro Bodega Carmen Rodríguez 8
US Silice 95 65 Oregon Chehalem Mountains Willamette Valley Pinot Noir Bergström 9
US Gap’s Crown Vineyard 95 60 California Sonoma Coast Sonoma Pinot Noir Blue Farm 10
Italy Ronco della Chiesa 95 80 Northeastern Italy Collio Friulano Borgo del Tiglio 11
US Estate Vineyard Wadensvil Block 95 48 Oregon Ribbon Ridge Willamette Valley Pinot Noir Patricia Green Cellars 12
US Weber Vineyard 95 48 Oregon Dundee Hills Willamette Valley Pinot Noir Patricia Green Cellars 13
France Château Montus Prestige 95 90 Southwest France Madiran Tannat Vignobles Brumont 14
US Grace Vineyard 95 185 Oregon Dundee Hills Willamette Valley Pinot Noir Domaine Serene 15
US Sigrid 95 90 Oregon Willamette Valley Willamette Valley Chardonnay Bergström 16
US Rainin Vineyard 95 325 California Diamond Mountain District Napa Cabernet Sauvignon Hall 17
Spain 6 Años Reserva Premium 95 80 Northern Spain Ribera del Duero Tempranillo Valduero 18
France Le Pigeonnier 95 290 Southwest France Cahors Malbec Château Lagrézette 19
US Gap’s Crown Vineyard 95 75 California Sonoma Coast Sonoma Pinot Noir Gary Farrell 20
US Grignolino 95 24 California Napa Valley Napa Rosé Heitz 21
Spain Prado Enea Gran Reserva 95 79 Northern Spain Rioja Tempranillo Blend Muga 22
Spain Termanthia 95 220 Northern Spain Toro Tinta de Toro Numanthia 23
US Giallo Solare 95 60 California Edna Valley Central Coast Chardonnay Center of Effort 24
US R-Bar-R Ranch 95 45 California Santa Cruz Mountains Central Coast Pinot Noir Comartin 25
New Zealand Maté’s Vineyard 94 57 Kumeu Chardonnay Kumeu River 26
US Shea Vineyard 94 62 Oregon Willamette Valley Pinot Noir Bergström 27
US Abetina 94 105 Oregon Willamette Valley Willamette Valley Pinot Noir Ponzi 28
US Garys’ Vineyard 94 60 California Santa Lucia Highlands Central Coast Pinot Noir Roar 29
US The Funk Estate 94 60 Washington Walla Walla Valley (WA) Columbia Valley Syrah Saviah 30
Bulgaria Bergulé 90 15 Bulgaria Mavrud Villa Melnik 31
US Babushka 90 37 California Russian River Valley Sonoma Chardonnay Zepaltas 32
Italy Vigna Piaggia 90 NA Tuscany Brunello di Montalcino Sangiovese Abbadia Ardenga 33
France Nonpareil Trésor Rosé Brut 90 22 France Other Vin Mousseux Sparkling Blend Bouvet-Ladubay 34
US Conner Lee Vineyard 90 42 Washington Columbia Valley (WA) Columbia Valley Chardonnay Buty 35
Italy Riserva 90 135 Tuscany Brunello di Montalcino Sangiovese Carillon 36
France 90 60 Rhône Valley Châteauneuf-du-Pape Rhône-style White Blend Clos de L’Oratoire des Papes 37
Italy 90 29 Tuscany Vino Nobile di Montepulciano Sangiovese Avignonesi 38
Italy 90 23 Tuscany Chianti Classico Sangiovese Casina di Cornia 39
Italy Riserva 90 29 Tuscany Chianti Classico Red Blend Castello di Monterinaldi 40
Spain Amandi 90 17 Galicia Ribeira Sacra Mencía Don Bernardino 41
Spain Alfonso Oloroso Seco 90 26 Andalucia Jerez Palomino González Byass 42
US Private Reserve 90 55 Idaho Petite Sirah Huston 43
Italy Riserva 90 39 Tuscany Chianti Classico Red Blend Rignana 44
France Coteaux 90 69 Rhône Valley Cornas Syrah Tardieu-Laurent 45
Italy Vigneto Odoardo Beccari Riserva 90 30 Tuscany Chianti Classico Red Blend Vignavecchia 46
Italy Poggio alle Mura 90 90 Tuscany Brunello di Montalcino Sangiovese Banfi 47
US Estate Grown 90 60 California Mount Veeder Napa Cabernet Sauvignon Brandlin 48
Italy 90 50 Tuscany Brunello di Montalcino Sangiovese Brunelli Martoccia 49
US 90 40 Washington Red Mountain Columbia Valley Cabernet Sauvignon Canvasback 50
Italy Riserva 90 100 Tuscany Brunello di Montalcino Sangiovese Capanne Ricci 51
France 90 68 Burgundy Chassagne-Montrachet Chardonnay Chartron et Trébuchet 52
France Les 7 Hommes 90 42 Loire Valley Sancerre Sauvignon Blanc Cherrier Frères 53
France L’Inédit 90 28 Loire Valley Coteaux du Giennois Pinot Noir Clement et Florian Berthier 54
US 90 18 California Russian River Valley Sonoma Chardonnay De Loach 55
US Four Flags 90 69 Washington Red Mountain Columbia Valley Cabernet Sauvignon DeLille 56
France Le Pavé 90 NA Loire Valley Sancerre Sauvignon Blanc Domaine Vacheron 57
US Reserve 90 25 New York Finger Lakes Finger Lakes Riesling Dr. Konstantin Frank 58
US Final Final 90 30 Washington Columbia Valley (WA) Columbia Valley Cabernet Sauvignon-Syrah Efeste 59
Italy Poggio Bestiale 90 60 Tuscany Maremma Toscana Red Blend Fattoria di Magliano 60
Argentina The Apple Doesn’t Fall Far From The Tree 91 30 Mendoza Province Mendoza Malbec Matias Riccitelli 61
Australia 91 36 Victoria Mornington Peninsula Pinot Noir Moorooduc 62
Argentina Alegoría Gran Reserva 91 25 Mendoza Province Mendoza Malbec Navarro Correas 63
France L’Homme Mort Premier Cru 91 45 Burgundy Chablis Chardonnay Domaine Chenevières 64
Portugal 91 23 Alentejano Portuguese Red Herdade do Rocim 65
US Estate Select 91 36 California Santa Clara Valley Central Coast Syrah Jason-Stephens 66
France Fourchaume Premier Cru 91 38 Burgundy Chablis Chardonnay Louis Max 67
US Animo 91 85 California Napa Valley Napa Cabernet Sauvignon Michael Mondavi Family Estate 68
US Schindler Vineyard 91 50 Oregon Eola-Amity Hills Willamette Valley Pinot Noir Panther Creek 69
US Barrel Select 91 60 California Rutherford Napa Cabernet Sauvignon Provenance Vineyards 70
US District Collection 91 85 California St. Helena Napa Cabernet Sauvignon Raymond 71
US 91 45 California Sonoma Coast Sonoma Pinot Noir Red Car 72
Italy Bussia Riserva 91 NA Piedmont Barolo Nebbiolo Silvano Bolmida 73
US 91 19 Oregon Willamette Valley Willamette Valley Pinot Gris Trinity Vineyards 74
Portugal Premium 91 15 Alentejo Portuguese Red Adega Cooperativa de Borba 75
US Premier Cuvée 91 54 Oregon Willamette Valley Willamette Valley Pinot Noir Archery Summit 76
France Le Nombre d’Or Brut Nature 91 85 Champagne Champagne Chardonnay Aubry 77
US Juliana Vineyard 91 38 California Napa Valley Napa Sauvignon Blanc B Cellars 78
US 91 28 California Napa Valley Napa Cabernet Sauvignon B Side 79
Italy Boscato 91 75 Piedmont Barolo Nebbiolo Bel Colle 80
US Aeolian 91 42 Oregon Eola-Amity Hills Willamette Valley Pinot Noir Bethel Heights 81
Israel Reserve 91 25 Upper Galilee Cabernet Sauvignon Binyamina 82
Italy Palliano Riserva 91 NA Piedmont Roero Nebbiolo Ceste 83
Italy del Comune di Serralunga d’Alba 91 59 Piedmont Barolo Nebbiolo Cascina Cucco 84
Italy Bricco Luciani 91 85 Piedmont Barolo Nebbiolo Cascina del Monastero 85
Italy Bricco Gattera 91 80 Piedmont Barolo Nebbiolo Cordero di Montezemolo 86
France Montmains Premier Cru 91 45 Burgundy Chablis Chardonnay Domaine Gérard Duplessis 87
US 91 22 California Napa Valley Napa Cabernet Sauvignon Eagle Glen 88
US Bacigalupi Vineyard 91 65 California Russian River Valley Sonoma Pinot Noir Eleven Eleven 89
US Magnificat 91 50 California Napa Valley Napa Meritage Franciscan 90
US 86 10 California California California Other Cabernet Sauvignon Belle Ambiance 91
Portugal Marquês de Marialva Rosé Bruto 86 12 Beira Atlantico Baga Adega de Cantanhede 92
Italy Nature 86 22 Veneto Prosecco Glera De Stefani 93
US Small Lot Blend 86 13 California Mendocino County Mendocino/Lake Counties Chardonnay Parducci 94
Portugal Muros de Vinha 86 10 Douro Portuguese Red Quinta do Portal 95
France Château Beauvillain-Monpezat 86 14 Southwest France Cahors Malbec-Merlot Rigal 96
US 86 18 California California California Other Chardonnay The Naked Grape 97
US 86 36 California Sonoma Valley Sonoma Cabernet Sauvignon Tin Barn 98
France Pigmentum 86 15 Southwest France Buzet Merlot-Malbec Georges Vigouroux 99
France Pigmentum 86 10 Southwest France Côtes de Gascogne Ugni Blanc-Colombard Georges Vigouroux 100

Pick price and points

price points
235 96
110 96
90 96
65 96
66 95
73 95
65 95
110 95
65 95
60 95
80 95
48 95
48 95
90 95
185 95
90 95
325 95
80 95
290 95
75 95
24 95
79 95
220 95
60 95
45 95
57 94
62 94
105 94
60 94
60 94
15 90
37 90
NA 90
22 90
42 90
135 90
60 90
29 90
23 90
29 90
17 90
26 90
55 90
39 90
69 90
30 90
90 90
60 90
50 90
40 90
100 90
68 90
42 90
28 90
18 90
69 90
NA 90
25 90
30 90
60 90
30 91
36 91
25 91
45 91
23 91
36 91
38 91
85 91
50 91
60 91
85 91
45 91
NA 91
19 91
15 91
54 91
85 91
38 91
28 91
75 91
42 91
25 91
NA 91
59 91
85 91
80 91
45 91
22 91
65 91
50 91
10 86
12 86
22 86
13 86
10 86
14 86
18 86
36 86
15 86
10 86

Plot price and points

Apply PCA

What went wrong?

  • PCA assumes data is distributed as multivariate normal
  • price data very non-normal

Take log transform of price

Reapply PCA

Variance explained by each component

What do PCs mean?

Project data onto first two PCs

Sparse PCA

  • PCA components typically load on all variables: tricky to assign meaning to them
  • Sparse PCA methods penalise either the count or magnitude of non-zero loadings on the independent variables \(\implies\) can lead to more interpretable components

Plot sparse loadings

PCA: summary

  • PCA is a projection method using linear transformations: collectively those linear transformations represent a rotation of coordinate axes
  • PCA assumes data are multivariate normal
    • if not a good approximation, components can be poor representation
  • works ok for simple data; less well for non-linear features

Questions?

Manifold learning

What is it?

Essentially these methods try to learn the structure of the local manifolds in the data. (Also referred to as graph-based or embedding methods.)

  • t-distributed stochastic nearest neighbour embedding (t-SNE)
  • uniform manifold approximation and projection (UMAP)

t-distributed stochastic nearest neighbour embedding

High level detail

  • original data in “data space” of dimensions \(\mathbb{R}^k\), where \(k\) is quite large
  • t-SNE transforms data non-linearly to a “map space” of dimensions \(\mathbb{R}^2\) or \(\mathbb{R}^3\) typically
  • constructs the mapping such that points close in “data space” are also close in “map space”

Mapping tries to preserves local neighbours

Similarity in data space: fuzzy neighbours

  • imagine a Gaussian density around each point and define a conditional similarity:

\[\begin{equation} p_{j|i} = \frac{\exp(-|x_i-x_j|^2/2\sigma_i^2)}{\sum_{k\neq i} \exp(-|x_i-x_k|^2/2\sigma_i^2)} \end{equation}\]

where \(\sigma_i\) is different for each point: points in dense areas given smaller variance than those in sparse areas

  • \(p_{j|i}\) measures the probability that \(x_i\) would pick \(x_j\) as its neighbour

Symmetrised similarity

Need a distance metric, so define:

\[\begin{equation} p_{ij} = \frac{1}{N} (p_{j|i} + p_{i|j}) \end{equation}\]

yielding a similarity matrix with elements \(p_{ij}\)

Similarity in map space

Uses Student-t distribution with one degree of freedom (the Cauchy):

\[\begin{equation} q_{ij} = \frac{1/(1+|z_i-z_j|^2)}{\sum_{k\neq i} 1/(1+|z_i-z_k|^2)} \end{equation}\]

How to choose \(z_i\)?

Algorithm to choose \(z_i\)

We can calculate the Kullback-Leibler divergence (not a distance!) from \(p_{ij}\rightarrow q_{ij}\):

\[\begin{equation} \text{KL} = \sum_{i,j} p_{ij} \log \frac{p_{ij}}{q_{ij}} \end{equation}\]

Minimise via gradient descent using \(\frac{\partial \text{KL}}{\partial z_i}\)

Picking \(\sigma_i\)

  • more uncertainty (i.e. entropy) if \(\sigma_i\uparrow\)
  • user specifies a “perplexity” hyperparameter which effectively determines number of nearest neighbours
  • t-SNE chooses \(\sigma_i\) to meet this perplexity for each data point

Applying t-SNE to wine data

Perplexity=1

Perplexity: 100

UMAP

What is UMAP?

embedding method like t-SNE, with similar steps:

  • constructs a (fuzzy) nearest neighbour graph
  • optimises a lower dimensional representation of this graph

Nearest neighbour graph construction: simplicial complexes

Low dimensional graph construction

Minimise cross-entropy assuming connection is described by Bernoulli dist:

\[\begin{equation} \sum_e w_h(e) \log \frac{w_h(e)}{w_l(e)} + (1 - w_h(e)) \log \frac{1-w_h(e)}{1-w_l(e)} \end{equation}\]

where \(w_h\) are high dimensional weights; \(w_l\) are low dimensional weights.

UMAP hyperparameters

  • n_neighbours: the number of approximate nearest neighbours used to construct the high dimensional graph. Low values \(\implies\) local structure; high values \(\implies\) global structure but slower run time
  • min_dist: minimum distance between points in low dimensional space. High values results in a more dense embedding

Applying UMAP to wine data

High min_dist

Low min_dist

Summary t-SNE and UMAP

  • both manifold / emedding / graph methods
  • t-SNE typically much slower
  • global structure preserved better in UMAP
  • hyperparameters really matter for both

Results from various dim reduction methods

Questions?

Clustering

What does clustering aim to achieve?

K-means clustering

idea:

  • set number of clusters, \(k\), a priori
  • determine cluster centres such that distances of points within clusters to centres is minimised

K-means clustering

How to achieve k-means

  1. randomly select \(k\) data points and make these the centres
  2. determine distance of all points from each of the centres
  3. assign clusters to nearest centres
  4. calculate new centres as means of points within each cluster
  5. repeat 2-4 until clusters stop changing (or max iterations reached)

Problem with k-means

Problem with k-means

Problem with k-means

  • assumes data spherical (i.e. a multivariate Gaussian with diagonal covariance matrix)
  • requires hard-coded number of clusters
  • non-model based so hard to determine optimal number of clusters

Gaussian mixture models

Model-based clustering

data assumed to be generated by a mixture of Gaussians:

\[\begin{align} p(x_i) &= \sum_k p(x_i|z_i=k) p(z_i=k)\\ &= \sum_k \mathcal{N}(x_i|\mu_k, \Sigma_k) \pi_k \end{align}\]

where \(\mu_k\) and \(\Sigma_k\) are the mean and covariance matrix of the Gaussian mode corresponding to cluster \(k\); \(\pi_k\) is the mixture proportion

Generative model

  1. roll a die with \(k\) faces with each face weighted by \(\pi_k\) \(\implies\) cluster \(j\)
  2. draw \(x_i\sim\mathcal{N}(\mu_j, \Sigma_j)\)

repeat steps 1 and 2 repeatedly to draw a sample

How to fit

  • since \(z_i\) are typically unobserved (this is the point of clustering), maximum likelihood estimation is hard
  • instead expectation-maximisation (EM) algorithm is used

Revisiting star data

summary of GMMs

  • better than k-means because model-based and more general Gaussian
  • still requires number of clusters to be specified a priori, although model selection can take place to choose optimal cluster number
  • do we actually believe there is fixed cluster number? Dirichlet process mixture models!

Summary

Summary

  • unsupervised and supervised learning aim to achieve different goals
  • dimensionality reduction is one variety of unsupervised learning
  • PCA is a linear projection method
  • t-SNE and UMAP are (non-linear) manifold learning methods that can better approximate lower dimensional approximations

Summary

  • clustering methods can reduce data down to a single dimension
  • k-means works for simple datasets
  • GMMs are model-based and more general (but still have flaws)

Questions?